Apparatus and method for communicating messages between data processing nodes using remote reading of message queues

ABSTRACT

A multi-nodal data processing system in which each node has a local memory for storing message send vectors, one for each other node in the system. When a node has a message to send, it places the message in the message send vector corresponding to the destination node of that message. When a node is ready to receive messages, it reads messages from the message send vectors corresponding to this node in the other nodes. Each message send vector has a head pointer and a tail pointer for defining the head and tail of a queue of messages. Each tail pointer is held locally, in the same node as the message send vector to which it relates, while the head pointer is held in the destination node of that message send vector.

BACKGROUND TO THE INVENTION

This invention relates to multi-nodal data processing systems. Morespecifically the invention is concerned with providing a mechanism forcommunicating messages between the nodes of such a system.

One way of communicating messages between nodes is for the sending nodeto transmit the messages to the receiving node over an inter-nodenetwork. A problem with this, however, is that the receiving node maybecome overloaded with messages received from other nodes, and as aresult messages may be lost.

An object of the present invention is to provide an improved messagepassing mechanism that does not suffer from this problem.

SUMMARY OF THE INVENTION

According to the invention there is provided a data processing systemcomprising a plurality of data processing nodes, wherein each nodecomprises:

(a) local memory means for storing a plurality of message send vectors,one for each other node in the system,

(b) message send means for placing messages in said message sendvectors, each message being placed in the message send vectorcorresponding to the destination node of that message, and

(c) message receive means for reading messages from the message sendvectors corresponding to this node in the other nodes.

BRIEF DESCRIPTION OF THE DRAWINGS

FIGS. 1 and 2A is a block diagram of a multi-nodal data processingsystem including a message passing mechanism in accordance with theinvention.

FIG. 2 is a schematic block diagram showing data structures used by themessage passing mechanism.

FIG. 3 is a flow chart showing the operation of the message passingmechanism in one node when it requires to pass a message to anothernode.

FIG. 4 is a flow chart showing the operation of the message passingmechanism in one node when it is ready to receive a message from anothernode.

DESCRIPTION OF AN EMBODIMENT OF THE INVENTION

One embodiment of the invention will now be described by way of examplewith reference to the accompanying drawings.

Referring to FIG. 1, the system comprises a plurality of data processingnodes 10. Each node includes one or more data processing elements 11,and one or more local memory modules 12.

The nodes are interconnected by an inter-node network 13, which allowsthe nodes to send messages to each other. The nodes are also allconnected to an input/output (I/O) network 14, which allows the nodes toaccess a number of disk controllers 15 and communications controllers16.

Referring now to FIG. 2, the local memory 12 in each node contains anumber of message send vectors 20, held in predetermined locations ofthe memory. Each vector contains a number of message slots, for holdinga queue of messages for a particular one of the other nodes. Thus,message send vector j in node i holds messages from node i for node j.The message size is fixed, and is preferably a multiple of the cacheline size of the system. For example, if the cache line size is 32bytes, the message size may typically be 128 or 256 bytes. The messagesare aligned with the cache lines. Each message has a checksum valueassociated with it, for detecting transmission errors.

Each message send vector 20 has a tail pointer 22 associated with it,pointing to the next available message slot in this vector, and a headpointer 24, pointing to the first message queued in this vector. Thetail pointer is held locally; that is, each tail pointer is held in thesame local memory as the message send vector to which it relates. Thehead pointers, on the other hand, are held remotely; that is, each headpointer is held in the local memory of the destination node of themessages in the message send vector. Thus, head pointer j in node ipoints to the first available message in message send vector i of nodej.

The local memory in each node also holds message receive vectors 26.These are used, as will be described, to hold local copies of messagesreceived from the other nodes.

Referring now to FIG. 3, when a node (node i) has a message for sendingto another node (node j), it performs the following actions. First, nodei checks whether there is a free message slot in message send vector jin its local memory, by comparing the head and tail pointers for thatvector. Assuming that there is at least one free message slot, node iwrites the message into the first available message slot in the messagesend vector, as indicated by the tail pointer. Then, node i updates thetail pointer, ie increments it by one. (Each message send vector isorganized as a circular queue, so that incrementing the head or tailpointer beyond the end of the vector returns it to the start of thevector). Finally, node i checks whether the queue has just changed frombeing empty to being non-empty, ie whether there is now exactly onemessage in the queue. If so, an interrupt signal is sent to node j toinform it that a message is now available for it to read.

Referring now to FIG. 4, when a node (node j) is ready to receive amessage from node i, it performs the following actions. First, node jperforms a remote read of the memory in node i, so as to read the valueof tail pointer for message send vector j in node i. Node j thencompares this tail pointer value with the corresponding head pointer forthat vector, which is held in its local memory, to see whether thatvector contains any messages. If the head pointer value is not equal tothe tail pointer value, this means that there is at least one messagefor node j in message send vector j of node i. Node j therefore proceedsto process all outstanding messages in the queue as follows.

Node j performs a remote read of the memory in node i, so as to read thefirst queued message from message vector j of that node, ie the messagepointed to by the head pointer. The message is copied into the messagereceive vector in node j corresponding to node i. Node j then performs achecksum test on the message, to see whether the message has beencorrectly received. If the checksum test fails, node j makes anotherattempt to copy the message from node i. If the failure still persistsafter a predetermined number of retries, the system is shut down.Assuming that the checksum test is correct, node j updates the headpointer of message send vector i in its local memory (ie increments itby one), so as to point to the next queued message. An appropriatemessage handler is then called, to process the current message.

The steps described in the preceding paragraph are repeated until it isfound that the head and tail pointers are equal, indicating that thequeue is now empty, ie there are no more messages waiting to beprocessed.

In summary, it can be seen that the message passing mechanism describedabove allows messages to be passed between nodes without the necessityfor any writes to each other's local memories. When a node has a messageto send, it simply writes the message to the appropriate message sendvector in its local memory, and this message will then be read by thedestination node, using a remote memory read.

We claim:
 1. A data processing system comprising: (a) a first dataprocessing node, including first memory means for holding a first queueof messages, (b) a second data processing node, including second memorymeans for holding a second queue of messages, and (c) an inter-nodenetwork interconnecting said first data processing node to said seconddata processing node, (d) said first data processing node furthercomprising: (i) first message send means for writing messages, destinedfor said second data processing node, into said first queue of messages,and (ii) first message receive means for performing remote reads of saidsecond memory means, by way of said inter-node network, to read messagesfrom said second queue of messages, (e) and said second data processingnode further comprising: (i) second message send means for writingmessages, destined for said first data processing node, into said secondqueue of messages, and (ii) second message receive means for performingremote reads of said first memory means, by way of said inter-nodenetwork, to read messages from said first queue of messages.
 2. A dataprocessing system comprising: (a) a first data processing node,including first memory means for holding a first queue of messages, (b)a second data processing node, including second memory means for holdinga second queue of messages, and (c) an inter-node networkinterconnecting said first data processing node to said second dataprocessing node, (d) said first data processing node further comprising:(i) first tail pointer means for pointing to a tail location in saidfirst queue of messages, (ii) first head pointer means for pointing to ahead location in said second queue of messages, (iii) first message sendmeans for using said first tail pointer means to write a message,destined for said second data processing node, into said tail locationin said first queue of messages, and (iv) first message receive meansfor performing a remote read of said second memory means, by way of saidinter-node network, using said first head pointer means, to read amessage from said head location in said second queue of messages, and(e) said second data processing node further comprising: (i) second tailpointer means for pointing to a tail location in said second queue ofmessages, (ii) second head pointer means for pointing to a head locationin said first queue of messages, (iii) second message send means forusing said second tail pointer means to write a message, destined forsaid first data processing node, into said tail location in said secondqueue of messages, and (iv) second message receive means for performinga remote read of said first memory means, by way of said inter-nodenetwork, using said second head pointer means, to read a message fromsaid head location in said first queue of messages.
 3. A data processingsystem comprising: (a) a first data processing node, (b) a plurality offurther data processing nodes, and (c) an inter-node networkinterconnecting said first data processing node to each of said furtherdata processing nodes, (d) said first data processing node comprising:(i) memory means for holding a plurality of queues of messages, saidqueues of messages being respectively associated with said further dataprocessing nodes, and (ii) message send means for writing messages,destined for said further data processing nodes, into respective ones ofsaid queues of messages, (e) and each of said further data processingnodes comprising message receive means for performing remote reads ofsaid memory means, by way of said inter-node network, to read messagesfrom a respective one of said queues of messages.
 4. A data processingsystem comprising: (a) a first data processing node, (b) a plurality offurther data processing nodes, and (c) an inter-node networkinterconnecting said first data processing node to each of said furtherdata processing nodes, (d) said first data processing node comprising:(i) memory means for holding a plurality of queues of messages, saidqueues of messages being respectively associated with said further dataprocessing nodes, (ii) tail pointer means for pointing to tail locationsin said queues of messages, and (ii) message send means for using saidtail pointer means to write messages, destined for said further dataprocessing nodes, into said tail locations of respective ones of saidqueues of messages, (e) and each of said further data processing nodescomprising: (i) head pointer means for pointing to a head location ineach of said queues of messages, and (ii) message receive means forperforming remote reads of said memory means, by way of said inter-nodenetwork, using said head pointer means, to read messages from said headlocation in said respective one of said queues of messages.
 5. A methodof operating a data processing system comprising a first data processingnode, including first memory means for holding a first queue ofmessages, a second data processing node, including second memory meansfor holding a second queue of messages, and an inter-node networkinterconnecting said first data processing node to said second dataprocessing node, said method comprising the steps: (i) operating saidfirst data processing node to write messages, destined for said seconddata processing node, into said first queue of messages, (ii) operatingsaid first data processing node to perform remote reads of said secondmemory means, by way of said inter-node network to read messages fromsaid second queue of messages, (iii) operating said second dataprocessing node to write messages, destined for said first dataprocessing node, into said second queue of messages, and (iv) operatingsaid second data processing node to perform remote reads of said firstmemory means, by way of said inter-node network, to read messages fromsaid first queue of messages.
 6. A method of operating a data processingsystem comprising a first data processing node, including memory meansfor holding a plurality of queues of messages, a plurality of furtherdata processing nodes, each of said further data processing nodes beingassociated with a respective one of said queues of messages, and aninter-node network interconnecting said first data processing node toeach of said plurality of further data processing nodes, said methodcomprising the steps: (i) operating said first data processing node towrite messages, destined for said further data processing nodes, intorespective ones of said queues of messages, and (ii) operating saidfurther data processing nodes to perform remote reads of said memorymeans, by way of said inter-node network, to read messages fromrespective ones of said queues of messages.